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Abstract Many estimation problems in astrophysics are highly complex, with high-dimensional, non-standard data 
objects (e.g., images, spectra, entire distributions, etc.) that are not amenable to formal statistical analysis. To utilize 
such data and make accurate inferences, it is crucial to transform the data into a simpler, reduced form. Spectral kernel 
methods are non-linear data transformation methods that efficiently reveal the underlying geometry of observable 
data. Here we focus on one particular technique: diffusion maps or more generally spectral connectivity analysis 
(SCA). We give examples of applications in astronomy; e.g., photometric redshift estimation, prototype selection for 
estimation of star formation history, and supernova light curve classification. We outline some computational and 
statistical challenges that remain, and we discuss some promising future directions for astronomy and data mining. 



1 Introduction 



The recent years have seen a rapid growth in the depth, richness, and scope of astronomical data. This trend is sure 
to accelerate with the next-generation all-sky surveys (e.g.. Dark Energy Survey (DES|^ Large Synoptic Survey Tele- 
scope (LSSTf] Panoramic Survey Telescope and Rapid Response System (PanSTARRSf] Visible and Infrared Survey 
Telescope for Astronomy (vistaQ, hence creating an ever increasing demand on sophisticated statistical methods 
that can draw fast and accurate inferences from large databases of high-dimensional data. From a data mining per- 
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spective, there are two general challenges one has to face. The first is the obvious computational challenge of rapidly 
processing and drawing inferences from massive data sets. The second is the statistical challenge of drawing accurate 
inferences from data that are high-dimensional and/or noisy. 

Many of the estimation problems in astronomical databases are extremely complex, with observed data that take 
a form not amenable to analysis via standard methods of statistical inference. To utilize such data, it is crucial to 
encode them in a simpler, reduced form. The most obvious strategy is to hand-pick a subset of attributes based on prior 
domain knowledge. For example, ratios of known emission lines in galaxy spectra may aid in the classification of low- 
redshift galaxies into starburst, active galactic nuclei, and passive galaxies. In astrophysical data analysis, a widely 
used technique for statistical learning is template fitting, where observed data are compared with sets of simulated or 
empirical data from systems with known properties; see e.g., ( |Bailer-Jones| 12010) |Dahlen et al.||2(310[|Hayden et al.| 
2010 Sesar et al. 2010| l for some recent template-based work in a variety of astrophysical contexts. Another common 



data mining approach is principal component analysis (PCA), a globally linear projection method that finds directions 
of maximum variance; see, e.g., ( Richards et al.|2009a and references therein; [Boroson and Lauer|2010| l. 

Despite their wide popularity in astrophysical data analysis, the above strategies to statistical learning all have 
obvious draw-backs: When handpicking a few attributes, one may discard potentially useful information in the data. 
For template fitting, the final estimates depend strongly on the particular selection of templates as well as the quality 
of each of the templates. Finally, PCA works best when the data lie in a linear subspace of the high-dimensional 
observable space, and can perform poorly when this is not the case. 

In this paper, we describe a more flexible approach to statistical learning that exploits the intrinsic (possibly non- 
linear) geometry of observable data with a minimum of assumptions. The idea is that naturally occurring data often 
have sparse structure due to constraints in the underlying physical process. In other words, the dimension d of the 



data space may be large but most of this space is empty. Spectral kernel methods, such as spectral clustering (Ng 



et al. 2001 von Luxburg| 2007| l, Laplacian maps (Belkin and Niyogi 20031, Hessian maps (Donoho and Grimes 



2003 1, and locally linear embeddings (Roweis and Saul 2000 1, analyze the data geometry by using certain differential 



operators and their corresponding eigenfunctions. These eigenfunctions provide a new coordinate system. For example, 
consider the emission spectra of astronomical objects. The original data with measurements at thousands of different 
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wavelengths are not in a form amenable to traditional statistical analysis and nonparametric regression. Fig.[T] however, 
shows a low-dimensional embedding of a sample of 2,793 SDSS galaxy spectra. The gray scale codes for redshift. The 
results indicate that by analyzing only a few dominant eigenfunctions of this highly complex data set, one can capture 
the main variability in redshift, although this quantity was not taken into account in the construction of the embedding. 
Moreover, the computed eigenfunctions are not only useful coordinates for the data. They form an orthogonal Hilbert 
basis for smooth functions of the data - a property that we utilize in Richards et al. ( 2009a[ l for redshift estimation. 




2nd Diffusion Coord. 



1st Diffusion Coord. 



Fig. 1 Embedding of a sample of 2,793 SDSS galaxy spectra using the first 3 diffusion map coordinates. The gray scale codes for redshift. 
(Reproduced from |Richards et al.|2009a| 



More generally, the central goal of spectral kernel methods can be described as follows: 

Find a transformation Z = f (X) such that the structure of the distribution Pz is simpler than the structure of the distribution 
Px while preserving key geometric properties of Px- 

"Simpler" can mean lower dimensional but can also be interpreted much more broadly. For example, for redshift pre- 
diction using photometric data ( [Freeman et al.[[2009] l, we transform the original 16 variables (for magnitude differences 
between five broad wavelength bandpasses, as measured using four different magnitude systems) to a 150-dimensional 
space. For the transformed data, we then fit an additive model of the form 

;=1 
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where Y denotes observed redshift, x is the original data object (galaxy), yfi{x) is the /:th coordinate after the transfor- 
mation, and e is some random noise. 



In this work, we focus on one particular non-linear data transformation called diffusion maps (Coifman et al. 2005 



Lafon and Lee 2006 1, which is an approach to spectral connectivity analysis (SCA; Lee and Wasserman|2010 1. SCA 



analyzes the higher-order connectivity of the data by defining a Markov process on a graph, where each graph node is 
an observable object, such as a spectrum, galaxy image, or set of light curves for a supernova, etc. The data are then 
transformed to a metric space where distances reflect the connectivity of the data. In Sec.|2] we describe the method. In 
Sec. [3] we give examples of some applications in astronomy. Finally, in Sec.[4j we discuss computational and statistical 
challenges for estimation for large astronomical databases, and outline some promising future directions. 

2 Spectral Connectivity Analysis 

There are several data transformation methods that aim to find a low-dimensional embedding Z = ^{X) of the data 
while preserving key geometric properties of the data distribution Px in local neighborhoods. Examples of locality- 
preserving methods are local linear embedding, Laplacian eigenmaps, Hessian eigenmaps, local tangent space align- 
ment (LTSA; [Zhang and Zha|2002| , and diffusion maps. While the exact details vary, the optimal r-dimensional em- 
bedding (where r < d)is provided as the solution to an eigenvalue problem, where the first r eigenvectors (y/i, y/r) 
provide the new data coordinates. 

Here we elaborate on diffusion maps - a specific approach to spectral connectivity analysis; Euclidean Commute 



Time maps is a closely related SCA technique discussed in e.g., Fouss et al. (2005 1. Assume we observe data ^ohs = 
{xi, . . . ,x„}, where x e W' . The basic idea is to create a distance D{xi,Xj) that measures "connectivity" or how easily 
information "flows" from point x, to x^ in a Markov chain on the observed data. (The data "points" x, and x^ represent 
entire observable objects; for example, the full emission spectra of two astronomical objects, images of two galaxies, 
or light curves of two supernovae; D is a measure of distance between the objects.) High flow occurs in high-density 
regions, and points that are connected by many high-flow paths are close with respect to the diffusion metric. The 
transition matrix A of the Markov chain is based on a user-defined pairwise distance 4 (•, •) that is a good measure of 
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dissimilarity in local neighborhoods; a common choice is the Euclidean distance in W' but other dissimilarity measures 
that incorporate prior knowledge and measurement errors can also be used. We define the transition probability from 
X; to x; in one step by A(x;,x,) ~ exp( A{x,.xj)/£) ^^^j.^ £ > is a tuning parameter that determines the local 
neighborhood size. Let A,(x,,Xy) denote the f-step transition probability; the parameter t determines the amount of 
smoothing along high-density regions and the "scale" of the analysis. The diffusion distance between points x, and Xy 
is defined as 

where the sum is over all points z in the data set ^obs^ and (po{z) is the stationary distribution of the Markov chain as 
f — > oo. In practice, we never explicitly implement the Markov chain but instead solve an eigenproblem for an n-by-n 
matrix. Let Ai > A2 > . . . and {y/,} be the eigenvalues and corresponding right eigenvectors of the 1-step transition 
matrix A. The diffusion map "f? : <!?^bs (where r < n) is given by 

'f^(x) = (A;v/i(x),...,A>,(x)). (2) 



As shown in 



I Coifman and Lafon (2006 1 and Lafon & Lee (2006), it holds that D,(xi,xj) w ||'f^,(x,) - fr(xy)||, i.e., the 
Euclidean distance in the new coordinate system approximates the diffusion distance in the original coordinate system. 
Because all connections between data points are simultaneously considered, diffusion maps are robust to noise and 
outliers and they return embeddings where metrics are explicitly defined. 

Incorporating data geometry via f and SCA can lead to radically improved inference algorithms. For details on 
the statistical properties of SCA refer to |Lee and Wasserman| ( [20T0l l. In Sec. [3] we give examples of some specific 
applications in astronomy. 

Out-of-Sample Extensions of Empirical Data Sets 

Let ^ denote the space of all data. One can show that the random walk and the eigenvectors {y/j} derived from the 
finite set <%bs have meaningful limits as the sample size n -> 00. Hence, we can think of the eigenvectors of the discrete 
random walk as estimates of eigenfunctions {\//,(x)}jgN, defined on at the o/jservec/ values Xi, . . . ,x„. We estimate 
the function Yji'^) values of x not corresponding to one of the x, 's in the data set by the kernel-smoothed estimate 
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1=1 

where A(x,x,.) = ^^^^^.y This expression is known in the appUed mathematics Hterature as the Nystrom 
approximation. These out-of-sample extensions allow us to make predictions for new data points that are not in the 
sample using diffusion maps and, for example, adaptive regression (as in Sec. |3.1| l. 

3 Applications in Astronomy 

In this section, we give some examples of applications of SCA to astrophysical problems. Among other things, dif- 
fusion maps can be used to estimate parameters in a regression framework, build classification models, and select 
prototypes for parameter estimation in complex models. The details are described in separate papers. 



3.1 Adaptive Regression and Redshift Estimation 



In Richards et al. ( 2009a i, we show how one can take advantage of the underlying data structure in non-parametric 
regression such as redshift prediction. The main idea is to describe the intrinsic data geometry in terms of fundamental 
eigenmodes. These eigenmodes can be viewed both as (i) coordinates of the data, as in Fig.[T] and as (ii) orthogonal 
basis functions for curve estimation. The latter insight can be used to develop a general regression framework for 
sparse, complex data. 

Let ^ denote the space of all observed data. In regression, the goal is to predict a real-valued function /(x) 
for data x e ^ , when given a sample of known pairs (x,F) where the response y ==/(x)+e. If /eL2(^) and 
{y/i , 1/2, • • • , } is an orthonormal basis, then we can write 

oo 

/(X)=^/3,T;(X), 

;=i 

where the expansion coefficients j3j = / f{x)\ifj{x)dx. The standard approach in non-parametric curve estima- 
tion (Efromovich 1999 1 is to choose a fixed known basis (e.g., Fourier or wavelet bases) for, for example, L?{[0, 1]), 



and then extend the basis to two or three dimensions by a tensor product. Such an approach quickly becomes intractable 
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in higher dimensions. In astrophysical problems, the response Y may be the redshift, age or metalHcity of galaxies, 
and X is often a high-dimensional, non-standard data object, such as the emission spectrum measured at p > 1000 
wavelength bins, or photometry data in a color space with p > 10 dimensions. 



In Richards et al. (201 la i, we suggest a new, adaptive approach to non-parametric curve estimation, which utilize 
the data-driven (orthogonal) eigenfunctions , \j/2, ■■■,} computed by PCA or spectral kernel methods. The regres- 
sion function estimate ?(x) is then given by 

where the coefficients jSj are estimated from the data, and p is a smoothing parameter determined by cross-validation 
and a mean-squared error prediction risk. The method is computationally efficient, making it appropriate for large 
databases such as the SDSS. One can use the predictions to speed up more computationally expensive estimation 
techniques by narrowing down the relevant parameter space; e.g., the redshift range or the set of templates in cross- 
correlation techniques. Adaptive regression also provides a useful tool for quickly identifying outliers; e.g., misclassi- 
fied spectra, spectra with anomalous features, etc. In Richards et al., we consider a sample of 3835 galaxy spectra from 
the SDSS database. For this data, the estimates based on eigenmodes from SCA (diffusion maps) lead to markedly 
better predictions than estimates from PCA, indicating the importance of non-linear geometries. 

The development of fast and accurate methods of photometric redshift estimation is a crucial step towards being 



able to fully utilize the data of next-generation surveys for precision cosmology. In Freeman et al. ( 2009 1, we apply 
adaptive regression and SCA to the problem of photometric redshift estimation for three different data sets: 350,738 
SDSS main sample galaxies, 29,816 SDSS luminous red galaxies, and 5,223 galaxies from DEEP2 with CFHTLS 
ugriz photometry. For computational speed, we first derive diffusion coordinates for training sets limited to about 10^ 
galaxies, and then extend these coordinates to the full data sets by the Nystrom method. The final redshift predictions 



achieve an accuracy on par with that of existing ML-based techniques, e.g., artificial neural networks ( Colhster and 



Lahav 2004 1 and ^-nearest neighbors (Ball et al. 2008 1. 
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3.2 Prototype Selection for Estimation of Star Formation History 



Parameter estimation in astronomy and cosmology often requires the use of complex physical models. In a typical 
application the mapping from the parameter space to the observed data space is built on sophisticated physical theory 



or simulation models or both. In Richards et al. ( 2009b 201 la I, we study one such scenario: the problem of estimating 
star formation history (SFH) in galaxies given SDSS high-resolution spectra. A common technique in the astronomy 



literature, called empirical population synthesis (see e.g., Cid Fernandes et al. 2001 and references within), is to model 
each galaxy as a mixture of stars from different simple stellar populations (SSPs), where an SSP is defined as a group 
of stars with the same age and metallicity. The principle behind this method is that each galaxy consists of multiple 
subpopulations of stars of different ages and compositions so that the integrated observed light from each galaxy 
is a mixture of the light contributed by each SSP. By estimating the mixture coefficient of each SSP, one can then 
reconstruct the star formation rate and composition as a function of time throughout the life of that galaxy. 



In our work, we use theoretical SSP models by Bruzual and Chariot (2003 1. For the galaxy spectra, we adopt the 
empirical population synthesis model in Cid Fernandes et al. ( 2004[ l: 

where Y;^ is the light flux at wavelength A; Xj is the normalized jth SSP spectrum; and Yj G [0, 1] is the proportion 
of luminosity contributed by the jth SSP. (The remaining model parameters describe the flux normalization and ob- 
servational noise, such as the amount of reddening due to foreground dust, spectral distortions due the movement of 
stars within the observed galaxy, etc.) We fit the signal model in Eq. |4]to observed noisy galaxy data with maximum 
likelihood estimation and MCMC. We then derive various physical parameters of interest from the SSP parameters 
(which are known) and the component weights in the signal model (which are estimated). For example: the average 
log age of the stars in a galaxy, (logf) = Y!j=i 7/logf (X^), where t{Xj) is the age of the j.th SSP; similarly, the average 
log metallicity (logZ) = YJJ^^ 7, logZ(Xy), where Z{Xj) is the metallicity of the j:th SSP. 

An important question is: How should one choose the set of SSPs? Though the parameters that define each SSP are 
continuous, optimizing the signal model over a large set of SSPs on a fine parameter grid is computationally infeasible 
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and inefficient. As we shall see, it also leads to poor statistical estimates. In Richards et al. ( 201 la[ l, we introduce a 
principled approach of choosing a small basis of SSP prototypes for optimal SFH parameter estimation. The basic idea 
is to explore the underlying geometry of the SSP observable data, and quantize the vector space and effective support 
of these model components. In addition to greater computational efficiency, we achieve better estimates of the SFH 
target parameters. In simulations, our proposed quantization method obtains a substantial improvement in estimating 
the target parameters over the common method of employing a parameter grid. The main reason for the improvement is 
that under the presence of noise, components with similar functional forms will be indistinguishable. Hence, it is more 
advantageous to choose prototypes that are approximately evenly spaced in the space of model data rather than evenly 
spaced in the parameter space. By replacing the theoretical models in each neighborhood by their local average in 
diffusion space ("Diffusion A'-means"; Figure|2|, the model quantization approach is optimal for treating degeneracies 
because it allows a slight increase in bias to achieve a large decrease in variance of the target parameter estimates. 
See Figure [3] for a plot of two SSP spectral bases with K prototypes chosen by a regular parameter grid and by our 
proposed quantization method, respectively. 



Diffusion K-means for SSP spectra, K=45, e=100 
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Fig. 2 Prototyping of SSP spectra by Diffusion AT-means. Representation of 1278 SSP spectra in 3-dimensional diffusion space. Large 
black dots denote the K = AS centroids. Individual SSPs are colored by cluster membership. The theoretical SSPs reside on a simple, 
low-dimensional manifold which is captured by the K prototypes. (Reproduced from |Richards et al.|2009b| 
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Normalized SSP spectra, CF05 
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Normalized prototype SSP spectra, Diffusion K-n 
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Fig. 3 Basis spectra for CF05 and Diffusion A'-means, colored by log/. All spectra are normalized to 1 at Ao = 4020 A. The diffusion 
AT-means basis covers the spectral range in relatively uniform increments, while the CF05 basis oversamples spectra from young stellar 
populations and undersamples the spectral range of older populations. (Reproduced from |Richards et al.|2009b^ 



3.3 Supernova Classification 



In many astronomical problems, classification is of paramount importance. For instance, one may be interested in 
determining which of a collection of light curves is associated with RR Lyrae stars, or Cepheids, etc. Depending on 
the problem, classification may be done in an unsupervised manner, to uncover hidden structure in the data, or, if at 
least some of the data labels are known, a classifier can be trained and then used to predict the classes of unlabeled 
data. 
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The next generation of survey telescopes will observe hundreds of thousands of noisy and irregular photometric 
SN light curves, from which astronomers will want to construct highly pure and efficient Type la SN samples for 



use in testing cosmological theories. In Richards et al. (201 lb i, we apply a semi-supervised approach to supernova 
classification. In the unsupervised step, we fit regression splines to each of a set of light curves, then via diffusion map 
place them in a lower-dimensional embedding space that capture the geometry of the underlying data distribution. In 
that space, we then take the supervised step of training a random forest classifier with only the labeled data, with the 
results used to classify the unlabeled data. Applied to the data of the Supernova Photometric Classification Challenge 
(Kessler et al. 2010), we achieve 96% purity and 86% efficiency when labeling the training set; for the test set, the 
figures are 56% and 48% respectively. As the sample sizes (of unlabeled and/or labeled data) increase, our semi- 
supervised approach will yield progressively more accurate classifications, in contrast to template-based approaches 
which do not benefit from larger data sets. We also explore how different spectroscopic foUowup strategies affect these 
figures, finding that deeper surveys yielding fewer labeled SNe can produce better results than shallower surveys. 
Determination of an optimal labeling strategy is an important component of active learning, a topic we will return to 
in the discussion below. 



4 Discussion and Future Directions 

In this review, we have described SCA — a statistical technique for transforming complex, data objects into a coor- 
dinate system that reveals the structure of the underlying data distribution. Such a transformation may improve the 
performance of classification, regression, clustering and parameter estimation. We have seen applications of SCA in 
redshift prediction, estimation of star formation history and photometric supernova classification. Currently, we are 
working with Chad Schafer to develop SCA as a tool for combining theoretical modeling and observational evidence 
into optimal constraints on the parameters of physical models. The idea is to map observed data (e.g., light curves of 
Type la supernova) as well as distributions for the observable data, constrained by physical theory (e.g., cosmological 
models) into a simpler encoding space. The shared representation of data and distributions is then exploited to achieve 
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optimal constraints on physical theories, in the form of set estimators on the distribution space; see Schafer's SCMA 
201 1 talk and paper for details. 

Another promising direction of SCA is semi-supervised learning (SSL), in particular, in combination with active 
learning. Suppose that we have a regression or classification problem. The typical scenario in SSL is that we have 
access to a large database of unlabeled examples (e.g., photometric data with unknown redshift), but relatively few 
labeled examples (e.g., data with spectroscopically confirmed redshift). Classical regression and classification tech- 
niques only take advantage of labeled data, but the central idea behind SSL is that one can make use of the unlabeled 



data to improve predictions; see e.g., Belkin and Niyogi (2005 1; Lafferty and Wasserman (2007 1; Singh et al. (2008 1 
for theoretical results on SSL. In our supernova classification application, we showed that learning a low-dimensional 
coordinate system using unlabeled data improves subsequent classification by trees. We also found evidence that the 
exact choice of training examples has a large effect on the results. In future work, we plan to explore whether we 
can achieve greater accuracy in classification and regression problems with fewer training labels if a so-called active 
learner is allowed to repeatedly pose queries, in the form of unlabeled data instances to be labeled by an oracle. In the 



machine learning literature, there are many variants of active learning; see, e.g.. Settles (2010 1 for a literature survey. 
All these models involve a search through the hypothesis space. Such searches and subsequent queries could poten- 
tially be better adapted to the underlying data distribution via an unsupervised technique such as SCA that exploit 
clusters and groupings in data. 

Finally, there are the computational challenges of efficiently constructing weighted graphs and performing eigen- 
computations for very large databases. We are currently exploring several solutions - most notably, fast approximate 
nearest neighborhood searches via trees, eigencomputations via streaming PCA (Budavari et al. 2009| l, very large- 



scale algebraic computations via matrix randomization (N. Halko and Tropp 201 1 1, and subsampling combined with 
Nystrom extensions to reduce the size of the distance matrix that is effectively eigendecomposed. 
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